Skip to content

[NPUW]Optimize MoE (GPT-OSS-20B) TPS on NPU - DEVICE_ROUTED.#33847

Open
intelgaoxiong wants to merge 3 commits intoopenvinotoolkit:masterfrom
intelgaoxiong:xiong/gpt-oss_device_routed
Open

[NPUW]Optimize MoE (GPT-OSS-20B) TPS on NPU - DEVICE_ROUTED.#33847
intelgaoxiong wants to merge 3 commits intoopenvinotoolkit:masterfrom
intelgaoxiong:xiong/gpt-oss_device_routed

Conversation

@intelgaoxiong
Copy link
Contributor

@intelgaoxiong intelgaoxiong commented Jan 28, 2026

Details:

Background:
#33372 implemented HOST_ROUTED processing for MoE decoding.
But the trivial submission overhead limits the decoding throughput.

Optimization:
This PR optimized MoE TPS with DEVICE_ROUTED processing:

  • Experts selection is performed dynamically on the device using Gather operations, avoiding graph splitting and reducing host-device overhead.
  • Infer execution is the same with traditional LLM.

TPS can be improved from 12 t/s to 17.9 t/s.

NPUW config:

{
	"NPUW_DEVICES" : "NPU",
	"MAX_PROMPT_LEN" : 1024,
	"NPUW_MOE_TOKEN_CHUNK_SIZE" : 0,
	"NPUW_LLM_GENERATE_MOE_HINT" : "DEVICE_ROUTED",
	"NPUW_F16IC" : "YES",
	"NPUW_LLM_OPTIMIZE_V_TENSORS" : "YES",
	"NPU_TURBO" : "YES",
	"NPUW_DUMP_SUBS" : "YES",
	"NPUW_DUMP_IO" : "NO",
	"NPU_COMPILER_TYPE" : "DRIVER"
}

Tickets:

@github-actions github-actions bot added category: build OpenVINO cmake script / infra category: samples OpenVINO Runtime Samples category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin and removed category: samples OpenVINO Runtime Samples labels Jan 28, 2026
@intelgaoxiong intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch 2 times, most recently from d8f7978 to 10e6b84 Compare January 30, 2026 05:43
@intelgaoxiong intelgaoxiong changed the title [NPUW]DEVICE_ROUTED mode for MoE (GPT-OSS-20B) decoding on NPU. [NPUW]Optimize MoE (GPT-OSS-20B) TPS on NPU - DEVICE_ROUTED. Jan 30, 2026
@intelgaoxiong intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch from 10e6b84 to 97b9ea3 Compare January 31, 2026 02:55
@github-actions github-actions bot removed the category: build OpenVINO cmake script / infra label Jan 31, 2026
@intelgaoxiong intelgaoxiong marked this pull request as ready for review January 31, 2026 03:06
@intelgaoxiong intelgaoxiong requested review from a team as code owners January 31, 2026 03:06
@dmatveev dmatveev added this to the 2026.1 milestone Feb 1, 2026
@intelgaoxiong intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch 3 times, most recently from 3dacd25 to 270410d Compare February 3, 2026 05:20
@github-actions github-actions bot added the category: build OpenVINO cmake script / infra label Feb 3, 2026
@intelgaoxiong
Copy link
Contributor Author

#33924 is included.
Should be merged after #33924

@intelgaoxiong intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch from 270410d to 7099966 Compare February 4, 2026 01:54
Convert gather to 2D.

Gather before convert.

Keep gather indices as constant.

Use JustInferRequest for DEVICE_ROUTED mode.

Clean up transformations for DEVICE_ROUTED.

Update config for DEVICE_ROUTED: BEST_PERF + not cut LM head.

Refactor device routed transformation.

Refactor GatherTo2DGather.

Apply MoE defaults if not explicitly set in external config.

Collect MoE nodes in single loop.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
multiply considered.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
@intelgaoxiong intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch from 7099966 to 6fbaa79 Compare February 4, 2026 13:21
@intelgaoxiong
Copy link
Contributor Author

Rebased.

@esmirno esmirno self-requested a review February 5, 2026 12:28
Copy link
Contributor

@AlexanderKalistratov AlexanderKalistratov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk looks fine for me.
But please wait review of others

std::dynamic_pointer_cast<ov::op::v0::Constant>(tile->input_value(1).get_node_shared_ptr());
if (repeats_const) {
auto repeats_data = repeats_const->cast_vector<int64_t>();
if (!repeats_data.empty() && repeats_data[0] > k_value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is a possible situation that repeats_data[0] <= k_value?

} else {
// Constant reshape - check if dim 0 is expert dimension
auto shape_data = shape_const->cast_vector<int64_t>();
if (nodes.num_experts > 0 && !shape_data.empty() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodes.num_experts > 0 implicitly assumes that we find Tile node first.

}
}

return nodes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the whole function relies on some unobvious assumptions and layer names.
Why did you prefer it over MatcherPass and pattern matching?

}
}

void transform_dynamic_reshapes(LayerNodes& nodes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this?
Does it helps us later?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: build OpenVINO cmake script / infra category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants